NLP and Topic Modeling

In this notebook, I do some NLP, topic modeling, sentiment analyses and visualizations to help me better understand the overall dataset. I will apply the topic modeling techniques only to 5-star rated reviews because I want to know what consumers are liking/loving. This also helps computationally given the limiter capabilities of my laptop. In addition, none of the questions in this take-home assignment were about a brand's poor performance or how to improve things. Therefore, this serves as another reason why I have excluded reviews that are rated below 5-stars.

Note: I personally have not worked with topic modeling outside of a classroom assignment context. Therefore the application of topic modeling to real world data such as this is new to me. I had to do a bit of research to find ways of implementing topic modeling and making sense of it. Therefore, several codeblocks in this notebook have been adapted from external sources which I will link in a resources document in the current directory.

The contents of this notebook may take up to several hours to run

In [6]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS

from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

from sklearn.decomposition import LatentDirichletAllocation

from gensim import corpora, models
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import sys
import re, numpy as np, pandas as pd
from pprint import pprint
from bs4 import BeautifulSoup 

%matplotlib inline
import warnings
import logging
warnings.filterwarnings("ignore",category=DeprecationWarning)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)


np.random.seed(42)
In [3]:
reviews = pd.read_csv('../data/reviews.csv')

1. Sentiment

I make the assumption that review sentiment should be positively correlated with star rating. To test this assumption

  1. create a column with sentiment sores for each review
  2. append the column to dataframe
  3. plot rating vs sentiment

Create a sentiment feature

In [8]:
# merge review and title columns together 
reviews['review_text'] = [''.join(i) for i in zip(reviews["title"].map(str),reviews["review"])]

# drop original review and title columns 
reviews.drop(columns = ['title', 'review'], inplace = True)
In [9]:
# define a function that accepts text and returns the polarity
def detect_sentiment(review):
    #return TextBlob(text.decode('utf-8')).sentiment.polarity
    return TextBlob(review).sentiment.polarity
In [10]:
# create a new DataFrame column for sentiment (WARNING: SLOW!)
reviews['sentiment'] = reviews['review_text'].apply(detect_sentiment)
In [11]:
# box plot of sentiment grouped by star ratings (WARNING: 0 RATING MEANS THERE WAS NO RATING FOR THIS REVIEW)
reviews.boxplot(column='sentiment', by='rating');

As expected, the sentiment score increases with start rating.

The commented out code below is for text preprocessing in case TFIDF is needed for further modeling modeling

In [72]:
# def posts_to_words(review_text):
#     # Function to convert a raw review to a string of words
#     # The input is a single string (a customer review), and 
#     # the output is a single string (a preprocessed customer review)

#     # remove non-letters.
#     letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    
#     # convert to lower case, split into individual words.
#     words = letters_only.lower().split()
    
#     # define stop words
#     stop_words = stopwords.words('english')
 
#     # remove stop words.
#     meaningful_words = [w for w in words if not w in stop_words]
    
# #     # apply stemming to words to bring them to their root
# #     p_stemmer = PorterStemmer()
    
# #     # stem tokens
# #     stem_spam = [p_stemmer.stem.stem(i) for i in meaningful_words]
#     lemmatizer = WordNetLemmatizer()
#     lemmed_words = [lemmatizer.lemmatize(i) for i in meaningful_words]
    
#     return(" ".join(lemmed_words))
In [73]:
# def clean_posts(data):

#     print("Cleaning and parsing posts...")

#     j = 0
#     for post in data:
#         # Convert review to words, then append to clean_train_reviews.
#         clean.append(posts_to_words(post))
    
#         # If the index is divisible by 1000, print a message
#         if (j + 1) % 1000 == 0:
#             print(f'Review {j + 1} of {total_reviews}.')
        
#         j += 1
#     return clean
In [74]:
# # Get the number of posts based on the depanx dataframe size.
# total_reviews = reviews.shape[0]
# print(f'There are {total_reviews} reviews.')

# # Initialize an empty list to hold the clean posts.
# clean = []
# #clean_test_posts = []
There are 222049 reviews.
In [16]:
#clean_reviews = clean_posts(reviews['review_text'])
In [77]:
# reviews['review'] = clean_reviews
In [17]:
# # create a document-term matrix using TF-IDF
# vect = TfidfVectorizer(stop_words=stop_words, max_features = 60000, min_df = 10)
# dtm = vect.fit_transform(reviews.review)
# features = vect.get_feature_names()
# dtm.shape

2. Natural Language Processing (NLP)

Subset the data to include only 5-star rated reviews

In [22]:
reviews_5 = reviews[reviews['rating']==5]
reviews.shape
Out[22]:
(222049, 15)

Generate a bigram model, and trigram model with gensim

gensim allows to automatically detect common phrases comprised of multiple word expressions. Often times n-grams give more contextual info than single words.

The following several blocks of code take several hours to run

In [26]:
# Build the bigram and trigram models   # WARNING: TOOK ABOUT 30 MINS TO RUN 
bigram = gensim.models.Phrases(reviews_5['review_text'], min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[reviews_5['review_text']], threshold=100)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Define stopwords

In [72]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come', 'im','ive','dont','hes','got', 'wa', 'ha', 'amazing', 'love', 'like', 'wonderful', 'great', 'really', 'very', 'obsessed', 'ever', 'every', 'never', 'awesome', 'super', 'makes', 'feels', 'good', 'absolutely', 'especially', 'honestly', 'specifically', 'generally', 'definitely','i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't", 'from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come', 'im', 'ive', 'dont', 'hes', 'got', 'wa', 'ha', 'amazing', 'love', 'like', 'wonderful', 'great', 'really', 'very', 'obsessed', 'ever', 'every', 'never', 'awesome', 'super', 'makes', 'feels', 'good', 'from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come', 'im', 'ive', 'dont', 'hes', 'got', 'wa', 'ha', 'amazing', 'love', 'like', 'wonderful', 'great', 'really', 'very', 'obsessed', 'ever', 'every', 'never', 'awesome', 'super', 'makes', 'feels', 'good', 'from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come', 'im', 'ive', 'dont', 'hes', 'got', 'wa', 'ha', 'amazing', 'love', 'like', 'wonderful', 'great', 'really', 'very', 'obsessed', 'ever', 'every', 'never', 'awesome', 'super', 'makes', 'feels', 'good', 'definitely', 'especially', 'honestly'])

stop_words = list(set(stop_words))

Process the data: remove stopwords, create bigrams, trigrams and lemmatize the text

In [ ]:
def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Remove stopwords, form bigrams, trigrams and lemmatization"""
    texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    texts_out = []
    nlp = spacy.load('en', disable=['parser', 'ner'])
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]    
    return texts_out

data_ready = process_words(reviews_5['review_text'])  

3. Topic Modeling

Transform the preprocessed text data into a corpus appropriate for topic modeling

In [27]:
# Create Dictionary
id2word = corpora.Dictionary(data_ready)

# Create Corpus: term document frequency
corpus = [id2word.doc2bow(text) for text in data_ready]

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=20,
                                           passes=20,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)
lda_model.print_topics()
Out[27]:
[(0,
  '0.084*"skin" + 0.053*"product" + 0.023*"face" + 0.017*"feel" + 0.016*"dry" + 0.016*"eye" + 0.014*"cream" + 0.014*"smooth" + 0.012*"look" + 0.012*"glow"'),
 (1,
  '0.036*"product" + 0.018*"buy" + 0.017*"look" + 0.017*"beautiful" + 0.016*"little" + 0.016*"well" + 0.015*"makeup" + 0.014*"work" + 0.014*"week" + 0.014*"apply"'),
 (2,
  '0.047*"receive" + 0.040*"smell" + 0.032*"soft" + 0.029*"dry" + 0.029*"feel" + 0.029*"way" + 0.026*"clean" + 0.020*"leave" + 0.019*"far" + 0.019*"lash"'),
 (3,
  '0.072*"hair" + 0.037*"color" + 0.029*"look" + 0.022*"lip" + 0.018*"brush" + 0.016*"absolutely" + 0.016*"perfect" + 0.015*"palette" + 0.014*"shade" + 0.013*"eye"'),
 (4,
  '0.044*"scent" + 0.040*"time" + 0.037*"smell" + 0.035*"much" + 0.020*"new" + 0.019*"last" + 0.019*"fine" + 0.017*"wear" + 0.017*"price" + 0.017*"fragrance"')]

Create a dataframe containing info on the topics, keywords for each topic and the preprocessed review text

In [98]:
def format_topics_sentences(ldamodel=None, corpus=corpus, texts=data_ready):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['dominant_topic', 'perc_contribution', 'topic_keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data_ready)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['review_num', 'dominant_topic', 'topic_perc_contrib', 'keywords', 'text']
df_dominant_topic.head(4)
Out[98]:
review_num dominant_topic topic_perc_contrib keywords text
0 0 1.0 0.6437 product, buy, look, beautiful, little, well, makeup, work, week, apply [powder, shellac, original, mineral, veil, anything, unusually, oily, skin, creme, eye, color, e...
1 1 4.0 0.9006 scent, time, smell, much, new, last, fine, wear, price, fragrance [cashmere, mistthis, sensual, smell, scent, year, new, favorite]
2 2 2.0 0.4055 receive, smell, soft, dry, feel, way, clean, leave, far, lash [mascara, mascara, lash, look, full, long, receive, several, compliment, already, day]
3 3 3.0 0.4593 hair, color, look, lip, brush, absolutely, perfect, palette, shade, eye [qu, perfect, lip, balm, year, smooth, stay, lip, feel, terrific, container, screw, lid, pop, li...

Create another dataframe with an improved display

In [32]:
# display setting to show more characters in column
pd.options.display.max_colwidth = 100

sent_topics_sorteddf_mallet = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('dominant_topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['perc_contribution'], ascending=False)], 
                                            axis=0)

# Reset index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# format
sent_topics_sorteddf_mallet.columns = ['topic_num', "topic_perc_contrib", "keywords", "representative_text"]

# show
sent_topics_sorteddf_mallet.head()
Out[32]:
topic_num topic_perc_contrib keywords representative_text
0 0.0 0.9692 skin, product, face, feel, dry, eye, cream, smooth, look, glow [thai, cystic, acne, prone, skin, tell, moisturizer, toner, oil, free, moisturizer, problem, oil...
1 0.0 0.9634 skin, product, face, feel, dry, eye, cream, smooth, look, glow [provide, protection, sun, give, coverage, skin, look, skin, well, sensitive, combination, skin,...
2 0.0 0.9600 skin, product, face, feel, dry, eye, cream, smooth, look, glow [product, work, leave, face, feel, soft, face, feel, extremely, hydrated, combination, skin, fac...
3 0.0 0.9574 skin, product, face, feel, dry, eye, cream, smooth, look, glow [excellent, serumamaze, product, live, serum, light, absorb, quickly, keep, skin, look, amorepac...
4 0.0 0.9574 skin, product, face, feel, dry, eye, cream, smooth, look, glow [eye, cream, forty, eye, cream, hope, texture, product, notice, difference, quickly, amount, ret...
In [33]:
print(reviews_5.shape)
print(sent_topics_sorteddf_mallet.shape)
(128455, 15)
(128455, 4)
In [34]:
# create an index column in sent_topics_sorteddf_mallet dataframe that will match the index of reviews_5 dataframe (I will use this column to join the two dataframes on). 
sent_topics_sorteddf_mallet['index'] = reviews_5.index
In [35]:
# create an index column matching the index of this dataframe 
reviews_5['index'] = reviews_5.index
In [37]:
df_dominant_topic['index'] = reviews_5.index
df_dominant_topic_2 = df_dominant_topic[['index', 'text']]
df_dominant_topic_2.head()
Out[37]:
index text
0 0 [powder, shellac, original, mineral, veil, anything, unusually, oily, skin, creme, eye, color, e...
1 1 [cashmere, mistthis, sensual, smell, scent, year, new, favorite]
2 3 [mascara, mascara, lash, look, full, long, receive, several, compliment, already, day]
3 5 [qu, perfect, lip, balm, year, smooth, stay, lip, feel, terrific, container, screw, lid, pop, li...
4 7 [moisturizer, hand, fragrance, smell, far, exceed, smell, anywhere, else]

Merge the dataframes

reviews_5 and the dataframe that contains data about the topics (sent_topics_sorteddf_mallet

In [38]:
data_with_topics = pd.merge(reviews_5, sent_topics_sorteddf_mallet, on = 'index')
In [40]:
data_with_topics.shape
Out[40]:
(128455, 20)
In [41]:
data_with_topics = pd.merge(data_with_topics, df_dominant_topic, on = 'index')
In [42]:
data_with_topics.rename(columns = {'Text':'parsed_review'}, inplace = True)
In [43]:
data_with_topics.shape
Out[43]:
(128455, 21)

Save the new dataframe into a csv file

In [44]:
data_with_topics.to_csv('../data/processed_df_topics.csv')

4. Visualizations of Topic Modeling Results

In the following section I won't comment on the visualizations as they look pretty self-exlanatory.

In [46]:
doc_lens = [len(d) for d in df_dominant_topic.text]

# Plot
plt.figure(figsize=(8,5), dpi=160)
plt.hist(doc_lens, bins = 200, color='navy')

plt.gca().set(xlim=(0, 250), ylabel='Number of Reviews', xlabel='Review Word Count')
plt.tick_params(size=16)
plt.xticks(np.linspace(0,250,9))
plt.title('Distribution of Review Word Counts', fontdict=dict(size=22))
plt.show()
In [89]:
# 1. Wordcloud of top n words in each topic
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]  # more colors: 'mcolors.XKCD_COLORS'

cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=20,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

topics = lda_model.show_topics(formatted=False)

fig, axes = plt.subplots(3, 2, figsize=(17,20), sharex=True, sharey=True)



for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')
    

plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-89-65792afaed57> in <module>
     23 for i, ax in enumerate(axes.flatten()):
     24     fig.add_subplot(ax)
---> 25     topic_words = dict(topics[i][1])
     26     cloud.generate_from_frequencies(topic_words, max_font_size=300)
     27     plt.gca().imshow(cloud)

IndexError: list index out of range
In [93]:
from collections import Counter
topics = lda_model.show_topics(formatted=False)
data_flat = [w for w_list in data_ready for w in w_list]
counter = Counter(data_flat)

out = []
for i, topic in topics:
    for word, weight in topic:
        out.append([word, i , weight, counter[word]])

df = pd.DataFrame(out, columns=['word', 'topic_id', 'importance', 'word_count'])        

# Plot Word Count and Weights of Topic Keywords
fig, axes = plt.subplots(2, 2, figsize=(10,10), sharey=True, dpi=160)
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
for i, ax in enumerate(axes.flatten()):
    ax.bar(x='word', height="word_count", data=df.loc[df.topic_id==i, :], color=cols[i], width=0.5, alpha=0.3, label='Word Count')
    ax_twin = ax.twinx()
    ax_twin.bar(x='word', height="importance", data=df.loc[df.topic_id==i, :], color=cols[i], width=0.2, label='Weights')
    ax.set_ylabel('Word Count', color=cols[i])
    ax_twin.set_ylim(0, 0.15); ax.set_ylim(0, 70000)
    ax.set_title('Topic: ' + str(i), color=cols[i], fontsize=16)
    ax.tick_params(axis='y', left=False)
    ax.set_xticklabels(df.loc[df.topic_id==i, 'word'], rotation=30, horizontalalignment= 'right')
    ax.legend(loc='upper left'); ax_twin.legend(loc='upper right')

fig.tight_layout(w_pad=2)    
fig.suptitle('Word Count and Importance of Topic Keywords', fontsize=22, y=1.05)    
plt.show()
In [103]:
# Get topic weights and dominant topics 
from sklearn.manifold import TSNE
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook

# Get topic weights
topic_weights = []
for i, row_list in enumerate(lda_model[corpus]):
    topic_weights.append([w for i, w in row_list[0]])

# Array of topic weights    
arr = pd.DataFrame(topic_weights).fillna(0).values

# Keep the well separated points (optional)
arr = arr[np.amax(arr, axis=1) > 0.35]

# Dominant topic number in each doc
topic_num = np.argmax(arr, axis=1)

# tSNE Dimension Reduction
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
tsne_lda = tsne_model.fit_transform(arr)

# Plot the topic clusters using Bokeh
output_notebook()
n_topics = 5
mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
plot = figure(title="t-SNE Clustering of {} LDA Topics".format(n_topics), 
              plot_width=900, plot_height=700)
plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
show(plot)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 113866 samples in 0.050s...
[t-SNE] Computed neighbors for 113866 samples in 5.561s...
[t-SNE] Computed conditional probabilities for sample 1000 / 113866
[t-SNE] Computed conditional probabilities for sample 2000 / 113866
[t-SNE] Computed conditional probabilities for sample 3000 / 113866
[t-SNE] Computed conditional probabilities for sample 4000 / 113866
[t-SNE] Computed conditional probabilities for sample 5000 / 113866
[t-SNE] Computed conditional probabilities for sample 6000 / 113866
[t-SNE] Computed conditional probabilities for sample 7000 / 113866
[t-SNE] Computed conditional probabilities for sample 8000 / 113866
[t-SNE] Computed conditional probabilities for sample 9000 / 113866
[t-SNE] Computed conditional probabilities for sample 10000 / 113866
[t-SNE] Computed conditional probabilities for sample 11000 / 113866
[t-SNE] Computed conditional probabilities for sample 12000 / 113866
[t-SNE] Computed conditional probabilities for sample 13000 / 113866
[t-SNE] Computed conditional probabilities for sample 14000 / 113866
[t-SNE] Computed conditional probabilities for sample 15000 / 113866
[t-SNE] Computed conditional probabilities for sample 16000 / 113866
[t-SNE] Computed conditional probabilities for sample 17000 / 113866
[t-SNE] Computed conditional probabilities for sample 18000 / 113866
[t-SNE] Computed conditional probabilities for sample 19000 / 113866
[t-SNE] Computed conditional probabilities for sample 20000 / 113866
[t-SNE] Computed conditional probabilities for sample 21000 / 113866
[t-SNE] Computed conditional probabilities for sample 22000 / 113866
[t-SNE] Computed conditional probabilities for sample 23000 / 113866
[t-SNE] Computed conditional probabilities for sample 24000 / 113866
[t-SNE] Computed conditional probabilities for sample 25000 / 113866
[t-SNE] Computed conditional probabilities for sample 26000 / 113866
[t-SNE] Computed conditional probabilities for sample 27000 / 113866
[t-SNE] Computed conditional probabilities for sample 28000 / 113866
[t-SNE] Computed conditional probabilities for sample 29000 / 113866
[t-SNE] Computed conditional probabilities for sample 30000 / 113866
[t-SNE] Computed conditional probabilities for sample 31000 / 113866
[t-SNE] Computed conditional probabilities for sample 32000 / 113866
[t-SNE] Computed conditional probabilities for sample 33000 / 113866
[t-SNE] Computed conditional probabilities for sample 34000 / 113866
[t-SNE] Computed conditional probabilities for sample 35000 / 113866
[t-SNE] Computed conditional probabilities for sample 36000 / 113866
[t-SNE] Computed conditional probabilities for sample 37000 / 113866
[t-SNE] Computed conditional probabilities for sample 38000 / 113866
[t-SNE] Computed conditional probabilities for sample 39000 / 113866
[t-SNE] Computed conditional probabilities for sample 40000 / 113866
[t-SNE] Computed conditional probabilities for sample 41000 / 113866
[t-SNE] Computed conditional probabilities for sample 42000 / 113866
[t-SNE] Computed conditional probabilities for sample 43000 / 113866
[t-SNE] Computed conditional probabilities for sample 44000 / 113866
[t-SNE] Computed conditional probabilities for sample 45000 / 113866
[t-SNE] Computed conditional probabilities for sample 46000 / 113866
[t-SNE] Computed conditional probabilities for sample 47000 / 113866
[t-SNE] Computed conditional probabilities for sample 48000 / 113866
[t-SNE] Computed conditional probabilities for sample 49000 / 113866
[t-SNE] Computed conditional probabilities for sample 50000 / 113866
[t-SNE] Computed conditional probabilities for sample 51000 / 113866
[t-SNE] Computed conditional probabilities for sample 52000 / 113866
[t-SNE] Computed conditional probabilities for sample 53000 / 113866
[t-SNE] Computed conditional probabilities for sample 54000 / 113866
[t-SNE] Computed conditional probabilities for sample 55000 / 113866
[t-SNE] Computed conditional probabilities for sample 56000 / 113866
[t-SNE] Computed conditional probabilities for sample 57000 / 113866
[t-SNE] Computed conditional probabilities for sample 58000 / 113866
[t-SNE] Computed conditional probabilities for sample 59000 / 113866
[t-SNE] Computed conditional probabilities for sample 60000 / 113866
[t-SNE] Computed conditional probabilities for sample 61000 / 113866
[t-SNE] Computed conditional probabilities for sample 62000 / 113866
[t-SNE] Computed conditional probabilities for sample 63000 / 113866
[t-SNE] Computed conditional probabilities for sample 64000 / 113866
[t-SNE] Computed conditional probabilities for sample 65000 / 113866
[t-SNE] Computed conditional probabilities for sample 66000 / 113866
[t-SNE] Computed conditional probabilities for sample 67000 / 113866
[t-SNE] Computed conditional probabilities for sample 68000 / 113866
[t-SNE] Computed conditional probabilities for sample 69000 / 113866
[t-SNE] Computed conditional probabilities for sample 70000 / 113866
[t-SNE] Computed conditional probabilities for sample 71000 / 113866
[t-SNE] Computed conditional probabilities for sample 72000 / 113866
[t-SNE] Computed conditional probabilities for sample 73000 / 113866
[t-SNE] Computed conditional probabilities for sample 74000 / 113866
[t-SNE] Computed conditional probabilities for sample 75000 / 113866
[t-SNE] Computed conditional probabilities for sample 76000 / 113866
[t-SNE] Computed conditional probabilities for sample 77000 / 113866
[t-SNE] Computed conditional probabilities for sample 78000 / 113866
[t-SNE] Computed conditional probabilities for sample 79000 / 113866
[t-SNE] Computed conditional probabilities for sample 80000 / 113866
[t-SNE] Computed conditional probabilities for sample 81000 / 113866
[t-SNE] Computed conditional probabilities for sample 82000 / 113866
[t-SNE] Computed conditional probabilities for sample 83000 / 113866
[t-SNE] Computed conditional probabilities for sample 84000 / 113866
[t-SNE] Computed conditional probabilities for sample 85000 / 113866
[t-SNE] Computed conditional probabilities for sample 86000 / 113866
[t-SNE] Computed conditional probabilities for sample 87000 / 113866
[t-SNE] Computed conditional probabilities for sample 88000 / 113866
[t-SNE] Computed conditional probabilities for sample 89000 / 113866
[t-SNE] Computed conditional probabilities for sample 90000 / 113866
[t-SNE] Computed conditional probabilities for sample 91000 / 113866
[t-SNE] Computed conditional probabilities for sample 92000 / 113866
[t-SNE] Computed conditional probabilities for sample 93000 / 113866
[t-SNE] Computed conditional probabilities for sample 94000 / 113866
[t-SNE] Computed conditional probabilities for sample 95000 / 113866
[t-SNE] Computed conditional probabilities for sample 96000 / 113866
[t-SNE] Computed conditional probabilities for sample 97000 / 113866
[t-SNE] Computed conditional probabilities for sample 98000 / 113866
[t-SNE] Computed conditional probabilities for sample 99000 / 113866
[t-SNE] Computed conditional probabilities for sample 100000 / 113866
[t-SNE] Computed conditional probabilities for sample 101000 / 113866
[t-SNE] Computed conditional probabilities for sample 102000 / 113866
[t-SNE] Computed conditional probabilities for sample 103000 / 113866
[t-SNE] Computed conditional probabilities for sample 104000 / 113866
[t-SNE] Computed conditional probabilities for sample 105000 / 113866
[t-SNE] Computed conditional probabilities for sample 106000 / 113866
[t-SNE] Computed conditional probabilities for sample 107000 / 113866
[t-SNE] Computed conditional probabilities for sample 108000 / 113866
[t-SNE] Computed conditional probabilities for sample 109000 / 113866
[t-SNE] Computed conditional probabilities for sample 110000 / 113866
[t-SNE] Computed conditional probabilities for sample 111000 / 113866
[t-SNE] Computed conditional probabilities for sample 112000 / 113866
[t-SNE] Computed conditional probabilities for sample 113000 / 113866
[t-SNE] Computed conditional probabilities for sample 113866 / 113866
[t-SNE] Mean sigma: 0.000153
[t-SNE] KL divergence after 250 iterations with early exaggeration: 94.962189
[t-SNE] KL divergence after 1000 iterations: 2.899112
Loading BokehJS ...
In [102]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word)
vis
Out[102]:

Topics 2 and 4 are overlapping. Fruther tuning of the model is necessary to imrprove topic separation